Ricardo J. Serrano
January 6, 2021
Supervised learning (target)
OLS (Ordinary Least Squares)
Simple, yet extremely useful and easy to explain
The library() function is used to load libraries, or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries. Here we load the MASS package, which is a very large collection of data sets and functions. We also load the ISLR2 package, which includes the data sets associated with this book.
Comment: Due to the recent controversies related to the housing dataset, we’ll use an alternate housing dataset
If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer. However, other packages, such as ISLR2, must be downloaded the first time they are used. This can be done directly from within R. For example, on a Windows system, select the Install package option under the Packages tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and R will automatically download the package. Alternatively, this can be done at the R command line via install.packages("ISLR2"). This installation only needs to be done the first time you use a package. However, the library() function must be called within each R session.
Melbourne Housing Market Kaggle Dataset
The dataset we’ll be using for the ISLR2 lab exercises is a subset, clean dataset from the original.
Features:
price: Price in Australian dollars
type: h - house, cottage, villa, semi, terrace; u - unit, duplex; t - townhouse
method: S - property sold; SP - property sold prior; PI - property passed in; VB - vendor bid; SA - sold after auction
council_area: Governing council for the area
distance: Number of distance
distance: Distance from CBD (Center Business District) in Kilometers
car: Number of carspots
year, month, day: extracted from Date - date sold
Lattitude: Self explanatory
Longtitude: Self explanatory
## # A tibble: 6 × 13
## price type method council_area rooms distance bathroom car lattitude
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1480000 h S Yarra City Counc… 2 2.5 1 1 -37.8
## 2 1035000 h S Yarra City Counc… 2 2.5 1 0 -37.8
## 3 1465000 h SP Yarra City Counc… 3 2.5 2 0 -37.8
## 4 850000 h PI Yarra City Counc… 3 2.5 2 1 -37.8
## 5 1600000 h VB Yarra City Counc… 4 2.5 1 2 -37.8
## 6 941000 h S Yarra City Counc… 2 2.5 1 0 -37.8
## # … with 4 more variables: longtitude <dbl>, year <dbl>, month <dbl>, day <dbl>
We will start by using the lm() function to fit a simple linear regression model, with price as the response and distance as the predictor. The basic syntax is {}, where y is the response, x is the predictor, and data is the data set in which these two variables are kept.
## Error in eval(predvars, data, env): object 'price' not found
The command causes an error because R does not know where to find the variables price and distance. The next line tells R that the variables are in housing. If we attach housing, the first line works fine because R now recognizes the variables.
If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us \(p\)-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the model.
##
## Call:
## lm(formula = price ~ distance)
##
## Coefficients:
## (Intercept) distance
## 1343119 -22306
##
## Call:
## lm(formula = price ~ distance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1115361 -385750 -145449 234547 10091093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1343118.6 8445.4 159.03 <2e-16 ***
## distance -22305.8 635.7 -35.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 634700 on 20991 degrees of freedom
## Multiple R-squared: 0.05541, Adjusted R-squared: 0.05537
## F-statistic: 1231 on 1 and 20991 DF, p-value: < 2.2e-16
Using ggpubr library to visualize the regression model line, regression equation and \(R^2\)
library(ggpubr)
housing %>%
ggplot(aes(x = distance, y = price)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
stat_regline_equation(label.x = 30, label.y = 8e+06) + # for regression equation
stat_cor(aes(label = ..rr.label..), label.x = 30, label.y = 7e+06) + # for R^2
theme_minimal()We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quantities by name—e.g. lm.fit$coefficients—it is safer to use the extractor functions like coef() to access them.
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
## (Intercept) distance
## 1343118.64 -22305.84
In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command. %Type confint(lm.fit) at the command line to obtain the confidence intervals.
## 2.5 % 97.5 %
## (Intercept) 1326564.97 1359672.32
## distance -23551.79 -21059.89
The intercept value 1,343,118.64 is the expected price when distance is equal to 0. Let’s fit a model with a more interpretable intercept by centering our independent variable:
## (Intercept) I(distance - mean(distance))
## 1089746.18 -22305.84
The new intercept (1,089,746.18) is the expected price of the average distance (11.359 km). Notice the estimated slope didn’t change.
The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of price for a given value of distance.
## fit lwr upr
## 1 1120060.2 1111308.7 1128811.8
## 2 897001.9 883231.1 910772.6
## 3 673943.5 649181.4 698705.5
## fit lwr upr
## 1 1120060.2 -124018.3 2364139
## 2 897001.9 -347122.1 2141126
## 3 673943.5 -570350.7 1918238
For instance, the 95,% confidence interval associated with a distance value of 20 is \((883231.1, 910772.6)\), and the 95,% prediction interval is \((-347122.1, 2141126)\). As expected, the confidence and prediction intervals are centered around the same point (a predicted value of \(897001.9\) for price when distance equals 20), but the latter are substantially wider.
We will now plot price and distance along with the least squares regression line using the plot() and abline() functions.
There is some evidence for non-linearity in the relationship between distance and price. We will explore this issue later in this lab.
Next we examine some diagnostic plots, several of which were discussed in Section 3.3.3. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() and mfrow() functions, which tell R to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a \(2 \times 2\) grid of panels.
Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.
On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues() function.
## 12633
## 12633
The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.
In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax {} is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.
##
## Call:
## lm(formula = price ~ distance + rooms, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3040188 -319587 -95314 207541 9719578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 341848.5 12681.8 26.96 <2e-16 ***
## distance -36913.0 553.3 -66.72 <2e-16 ***
## rooms 381540.2 4014.0 95.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 530700 on 20990 degrees of freedom
## Multiple R-squared: 0.3397, Adjusted R-squared: 0.3396
## F-statistic: 5398 on 2 and 20990 DF, p-value: < 2.2e-16
The housing data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:
##
## Call:
## lm(formula = price ~ ., data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1742844 -214484 -40073 149718 8989318
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.338e+08 1.700e+07 -7.872
## typet -3.423e+05 1.088e+04 -31.456
## typeu -5.335e+05 9.116e+03 -58.519
## methodS 7.374e+04 8.887e+03 8.297
## methodSA 2.054e+04 3.391e+04 0.606
## methodSP 3.725e+04 1.136e+04 3.280
## methodVB 2.571e+04 1.204e+04 2.136
## council_areaBayside City Council 4.446e+05 3.167e+04 14.038
## council_areaBoroondara City Council 5.597e+05 1.870e+04 29.937
## council_areaBrimbank City Council -4.211e+05 3.007e+04 -14.004
## council_areaCardinia Shire Council 2.905e+05 1.046e+05 2.778
## council_areaCasey City Council -5.056e+03 5.938e+04 -0.085
## council_areaDarebin City Council -4.412e+04 1.739e+04 -2.537
## council_areaFrankston City Council 3.503e+04 6.655e+04 0.526
## council_areaGlen Eira City Council 1.169e+05 2.707e+04 4.317
## council_areaGreater Dandenong City Council -9.049e+04 4.676e+04 -1.935
## council_areaHobsons Bay City Council -2.483e+05 3.356e+04 -7.398
## council_areaHume City Council -2.117e+05 2.531e+04 -8.364
## council_areaKingston City Council -6.807e+03 3.820e+04 -0.178
## council_areaKnox City Council -2.683e+04 3.737e+04 -0.718
## council_areaMacedon Ranges Shire Council 8.406e+05 9.040e+04 9.299
## council_areaManningham City Council 1.701e+05 2.145e+04 7.931
## council_areaMaribyrnong City Council -2.989e+05 2.787e+04 -10.725
## council_areaMaroondah City Council 1.869e+05 3.380e+04 5.531
## council_areaMelbourne City Council 6.867e+04 2.472e+04 2.778
## council_areaMelton City Council -3.404e+05 5.260e+04 -6.471
## council_areaMitchell Shire Council 8.215e+05 1.337e+05 6.145
## council_areaMonash City Council 1.019e+05 2.630e+04 3.875
## council_areaMoonee Valley City Council -4.699e+04 2.399e+04 -1.959
## council_areaMoorabool Shire Council 1.145e+05 2.096e+05 0.546
## council_areaMoreland City Council -1.041e+05 2.076e+04 -5.014
## council_areaNillumbik Shire Council 1.173e+05 5.515e+04 2.126
## council_areaPort Phillip City Council 2.344e+05 2.723e+04 8.607
## council_areaStonnington City Council 5.063e+05 2.560e+04 19.777
## council_areaWhitehorse City Council 1.710e+05 2.763e+04 6.189
## council_areaWhittlesea City Council -9.160e+04 2.318e+04 -3.952
## council_areaWyndham City Council -7.693e+05 5.133e+04 -14.986
## council_areaYarra City Council 2.893e+04 2.445e+04 1.183
## council_areaYarra Ranges Shire Council 3.103e+05 6.107e+04 5.081
## rooms 1.723e+05 4.408e+03 39.090
## distance -3.595e+04 1.032e+03 -34.831
## bathroom 1.815e+05 5.219e+03 34.774
## car 3.432e+04 3.029e+03 11.331
## lattitude -1.308e+06 1.176e+05 -11.126
## longtitude -5.202e+05 9.335e+04 -5.573
## year 7.952e+04 5.580e+03 14.250
## month 8.545e+03 1.064e+03 8.034
## day 8.010e+02 3.323e+02 2.411
## Pr(>|t|)
## (Intercept) 3.65e-15 ***
## typet < 2e-16 ***
## typeu < 2e-16 ***
## methodS < 2e-16 ***
## methodSA 0.544820
## methodSP 0.001041 **
## methodVB 0.032718 *
## council_areaBayside City Council < 2e-16 ***
## council_areaBoroondara City Council < 2e-16 ***
## council_areaBrimbank City Council < 2e-16 ***
## council_areaCardinia Shire Council 0.005475 **
## council_areaCasey City Council 0.932145
## council_areaDarebin City Council 0.011179 *
## council_areaFrankston City Council 0.598584
## council_areaGlen Eira City Council 1.59e-05 ***
## council_areaGreater Dandenong City Council 0.052965 .
## council_areaHobsons Bay City Council 1.43e-13 ***
## council_areaHume City Council < 2e-16 ***
## council_areaKingston City Council 0.858583
## council_areaKnox City Council 0.472755
## council_areaMacedon Ranges Shire Council < 2e-16 ***
## council_areaManningham City Council 2.29e-15 ***
## council_areaMaribyrnong City Council < 2e-16 ***
## council_areaMaroondah City Council 3.22e-08 ***
## council_areaMelbourne City Council 0.005471 **
## council_areaMelton City Council 9.96e-11 ***
## council_areaMitchell Shire Council 8.16e-10 ***
## council_areaMonash City Council 0.000107 ***
## council_areaMoonee Valley City Council 0.050126 .
## council_areaMoorabool Shire Council 0.585033
## council_areaMoreland City Council 5.37e-07 ***
## council_areaNillumbik Shire Council 0.033477 *
## council_areaPort Phillip City Council < 2e-16 ***
## council_areaStonnington City Council < 2e-16 ***
## council_areaWhitehorse City Council 6.17e-10 ***
## council_areaWhittlesea City Council 7.79e-05 ***
## council_areaWyndham City Council < 2e-16 ***
## council_areaYarra City Council 0.236838
## council_areaYarra Ranges Shire Council 3.78e-07 ***
## rooms < 2e-16 ***
## distance < 2e-16 ***
## bathroom < 2e-16 ***
## car < 2e-16 ***
## lattitude < 2e-16 ***
## longtitude 2.54e-08 ***
## year < 2e-16 ***
## month 9.93e-16 ***
## day 0.015934 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 399000 on 20945 degrees of freedom
## Multiple R-squared: 0.6275, Adjusted R-squared: 0.6266
## F-statistic: 750.6 on 47 and 20945 DF, p-value: < 2.2e-16
We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the \(R^2\), and summary(lm.fit)$sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages() function in R.
## GVIF Df GVIF^(1/(2*Df))
## type 1.602045 2 1.125042
## method 1.094008 4 1.011294
## council_area 1592.602432 32 1.122104
## rooms 2.310830 1 1.520141
## distance 6.670397 1 2.582711
## bathroom 1.758557 1 1.326106
## car 1.259841 1 1.122426
## lattitude 15.301284 1 3.911686
## longtitude 16.733732 1 4.090688
## year 1.619200 1 1.272478
## month 1.402228 1 1.184157
## day 1.010756 1 1.005364
What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, some categories in council_area have a high \(p\)-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except council_area.
##
## Call:
## lm(formula = price ~ . - council_area, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2158870 -244140 -74349 142816 9283678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.818e+08 1.306e+07 -21.567 < 2e-16 ***
## typet -3.293e+05 1.223e+04 -26.920 < 2e-16 ***
## typeu -4.582e+05 1.016e+04 -45.102 < 2e-16 ***
## methodS 5.450e+04 1.002e+04 5.438 5.46e-08 ***
## methodSA 1.116e+04 3.832e+04 0.291 0.7708
## methodSP -2.011e+04 1.277e+04 -1.574 0.1155
## methodVB 6.228e+04 1.359e+04 4.584 4.59e-06 ***
## rooms 1.717e+05 4.931e+03 34.810 < 2e-16 ***
## distance -4.578e+04 5.133e+02 -89.189 < 2e-16 ***
## bathroom 2.268e+05 5.813e+03 39.012 < 2e-16 ***
## car 3.750e+04 3.381e+03 11.093 < 2e-16 ***
## lattitude -1.619e+06 3.666e+04 -44.174 < 2e-16 ***
## longtitude 8.568e+05 2.815e+04 30.434 < 2e-16 ***
## year 4.809e+04 6.176e+03 7.786 7.26e-15 ***
## month 6.102e+03 1.196e+03 5.102 3.38e-07 ***
## day 8.004e+02 3.760e+02 2.129 0.0333 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 451800 on 20977 degrees of freedom
## Multiple R-squared: 0.5218, Adjusted R-squared: 0.5214
## F-statistic: 1526 on 15 and 20977 DF, p-value: < 2.2e-16
Alternatively, the update() function can be used.
It is easy to include interaction terms in a linear model using the lm() function. The syntax distance:rooms tells R to include an interaction term between distance and rooms. The syntax distance * rooms simultaneously includes distance, rooms, and the interaction term distance\(\times\)rooms as predictors; it is a shorthand for distance + rooms + distance:rooms. %We can also pass in transformed versions of the predictors.
##
## Call:
## lm(formula = price ~ distance * rooms, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2617401 -319640 -80357 192369 9697299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -170622.1 22496.6 -7.584 3.48e-14 ***
## distance 16286.9 2018.9 8.067 7.58e-16 ***
## rooms 549368.3 7292.5 75.333 < 2e-16 ***
## distance:rooms -16551.5 604.9 -27.361 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 521500 on 20989 degrees of freedom
## Multiple R-squared: 0.3624, Adjusted R-squared: 0.3623
## F-statistic: 3976 on 3 and 20989 DF, p-value: < 2.2e-16
The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor \(X\), we can create a predictor \(X^2\) using I(X^2). The function I() is needed since the ^ has a special meaning in a formula object; wrapping as we do allows the standard uscouncil_area in R, which is to raise X to the power 2. We now perform a regression of price onto distance and distance^2.
##
## Call:
## lm(formula = price ~ distance + I(distance^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1119220 -383250 -144346 233893 10096290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1370325.00 13313.01 102.931 < 2e-16 ***
## distance -26831.27 1826.16 -14.693 < 2e-16 ***
## I(distance^2) 137.09 51.86 2.643 0.00821 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 634600 on 20990 degrees of freedom
## Multiple R-squared: 0.05572, Adjusted R-squared: 0.05563
## F-statistic: 619.3 on 2 and 20990 DF, p-value: < 2.2e-16
The near-zero \(p\)-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.
## Analysis of Variance Table
##
## Model 1: price ~ distance
## Model 2: price ~ distance + I(distance^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 20991 8.4559e+15
## 2 20990 8.4531e+15 1 2.814e+12 6.9874 0.008215 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here Model 1 represents the linear submodel containing only one predictor, distance, while Model 2 corresponds to the larger quadratic model that has two predictors, distance and distance^2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the \(F\)-statistic is \(6.9874\) and the associated \(p\)-value is virtually zero. This provides very clear evidence that the model containing the predictors distance and distance^2 is superior to the model that only contains the predictor distance. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between price and distance. If we type
then we see that when the distance^2 term is included in the model, there is little discernible pattern in the residuals.
In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher-order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:
##
## Call:
## lm(formula = price ~ poly(distance, 5))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1162602 -386228 -136505 237029 10053259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1089746 4361 249.898 < 2e-16 ***
## poly(distance, 5)1 -22271737 631830 -35.250 < 2e-16 ***
## poly(distance, 5)2 1677491 631830 2.655 0.00794 **
## poly(distance, 5)3 6792303 631830 10.750 < 2e-16 ***
## poly(distance, 5)4 -4993044 631830 -7.903 2.87e-15 ***
## poly(distance, 5)5 1960455 631830 3.103 0.00192 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 631800 on 20987 degrees of freedom
## Multiple R-squared: 0.06409, Adjusted R-squared: 0.06387
## F-statistic: 287.4 on 5 and 20987 DF, p-value: < 2.2e-16
This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant \(p\)-values in a regression fit.
By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and p-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.
Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a square root transformation.
##
## Call:
## lm(formula = price ~ sqrt(distance), data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1191581 -379263 -147030 235353 10112362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1578581 15190 103.92 <2e-16 ***
## sqrt(distance) -151508 4507 -33.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 636100 on 20991 degrees of freedom
## Multiple R-squared: 0.05109, Adjusted R-squared: 0.05104
## F-statistic: 1130 on 1 and 20991 DF, p-value: < 2.2e-16
We will now examine the Carseats data, which is part of the ISLR2 library. We will attempt to predict Sales (child car seat sales) in \(400\) locations based on a number of predictors.
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
The Carseats data includes qualitative predictors such as shelveloc, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor shelveloc takes on three possible values: Bad, Medium, and Good. Given a qualitative variable such as shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
Using gtsummary library to format summary table
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| (Intercept) | 6.6 | 4.6, 8.6 | <0.001 |
| CompPrice | 0.09 | 0.08, 0.10 | <0.001 |
| Income | 0.01 | 0.01, 0.02 | <0.001 |
| Advertising | 0.07 | 0.03, 0.11 | 0.002 |
| Population | 0.00 | 0.00, 0.00 | 0.7 |
| Price | -0.10 | -0.12, -0.09 | <0.001 |
| ShelveLoc | |||
| Bad | — | — | |
| Good | 4.8 | 4.5, 5.1 | <0.001 |
| Medium | 2.0 | 1.7, 2.2 | <0.001 |
| Age | -0.06 | -0.09, -0.03 | <0.001 |
| Education | -0.02 | -0.06, 0.02 | 0.3 |
| Urban | |||
| No | — | — | |
| Yes | 0.14 | -0.08, 0.36 | 0.2 |
| US | |||
| No | — | — | |
| Yes | -0.16 | -0.45, 0.14 | 0.3 |
| Income * Advertising | 0.00 | 0.00, 0.00 | 0.007 |
| Price * Age | 0.00 | 0.00, 0.00 | 0.4 |
|
1
CI = Confidence Interval
|
|||
The contrasts() function returns the coding that R uses for the dummy variables.
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Use ?contrasts to learn about other contrasts, and how to set them.
R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location.
As we have seen, R comes with many useful functions, and still more functions are available by way of R libraries. However, we will often be interested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR2 and MASS libraries, called LoadLibraries(). Before we have created the function, R returns an error if we try to call it.
## Error in eval(expr, envir, enclos): object 'LoadLibraries' not found
## Error in LoadLibraries(): could not find function "LoadLibraries"
We now create the function. Note that the + symbols are printed by R and should not be typed in. The { symbol informs R that multiple commands are about to be input. Hitting Enter after typing { will cause R to print the + symbol. We can then input as many commands as we wish, hitting {Enter} after each one. Finally the } symbol informs R that no further commands will be entered.
LoadLibraries <- function() {
library(ISLR2)
library(MASS)
print("The libraries have been loaded.")
}Now if we type in LoadLibraries, R will tell us what is in the function.
## function() {
## library(ISLR2)
## library(MASS)
## print("The libraries have been loaded.")
## }
If we call the function, the libraries are loaded in and the print statement is output.
## [1] "The libraries have been loaded."